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Abstract — Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts 
worldwide interests and touches on many important applications in text mining, computer vision and computational biology This paper 
represents LDA as a factor graph within the Markov random field (MRF) framework, which enables the classic loopy belief propagation 
(BP) algorithm for approximate inference and parameter estimation. Although two commonly-used approximate inference methods, 
such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BP 
is competitive in both speed and accuracy as validated by encouraging experimental results on four large-scale document data sets. 
Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To this 
end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic 
models (RTM), using BP based on the factor graph representation. 

Index Terms — Latent Dirichlet allocation, topic models, belief propagation, message passing, factor graph, Bayesian networks, Markov 
random fields, hierarchical Bayesian models, Gibbs sampling, variational Bayes. 
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1 Introduction 

Latent Dirichlet allocation (LDA) ID is a three-layer 
hierarchical Bayesian model (HBM) that can infer prob- 
abilistic word clusters called topics from the document- 
word (document-term) matrix. LDA has no exact in- 
ference methods because of loops in its graphical rep- 
resentation. Variational Bayes (VB) [IJ and collapsed 
Gibbs sampling (GS) 12 have been two commonly-used 
approximate inference methods for learning LDA and 
its extensions, including author- topic models (ATM) |3| 
and relational topic models (RTM) [U. Other infer- 
ence methods for probabilistic topic modeling include 
expectation-propagation (EP) [5J and collapsed VB infer- 
ence (CVB) [61 . The connections and empirical compar- 
isons among these approximate inference methods can 
be found in [7J. Recently, LDA and HBMs have found 
many important real-world applications in text mining 
and computer vision (e.g., tracking historical topics from 
time-stamped documents [8J and activity perception in 
crowded and complicated scenes [9]). 

This paper represents LDA by the factor graph IfTOl 
within the Markov random field (MRF) framework fTT|. 
From the MRF perspective, the topic modeling problem 
can be interpreted as a labeling problem, in which the 
objective is to assign a set of semantic topic labels to 
explain the observed nonzero elements in the document- 
word matrix. MRF solves the labeling problem existing 
widely in image analysis and computer vision by two 
important concepts: neighborhood systems and clique po- 
tentials [12J or factor functions [llj. It assigns the best 
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topic labels according to the maximum a posteriori (MAP) 
estimation through maximizing the posterior probability, 
which is in nature a prohibited combinatorial optimiza- 
tion problem in the discrete topic space. However, we 
often employ the smoothness prior flS] over neighboring 
topic labels to reduce the complexity by encouraging or 
penalizing only a limited number of possible labeling 
configurations. 

The factor graph is a graphical representation method 
for both directed models (e.g., hidden Markov models 
(HMMs) lUi Chapter 13.2.3]) and undirected models 
(e.g., Markov random fields (MRFs) [11, Chapter 8.4.3]) 
because factor functions can represent both conditional 
and joint probabilities. In this paper, the proposed factor 
graph for LDA describes the same joint probability as 
that in the three-layer HBM, and thus it is not a new 
topic model but interprets LDA from a novel MRF 
perspective. The basic idea is inspired by the collapsed 
GS algorithm for LDA [2], [14], which integrates out 
multinomial parameters based on Dirichlet-Multinomial 
conjugacy and views Dirichlet hyperparameters as the 
pseudo topic labels having the same layer with the latent 
topic labels. In the collapsed hidden variable space, 
the joint probability of LDA can be represented as the 
product of factor functions in the factor graph. By con- 
trast, the undirected model "harmonium" |15| encodes 
a different joint probability from LDA and probabilistic 
latent semantic analysis (PLSA) [16], so that it is a new 
and viable alternative to the directed models. 

The factor graph representation facilitates the classic 
loopy belief propagation (BP) algorithm [10], IITTI , IflTl 
for approximate inference and parameter estimation. By 
designing proper neighborhood system and factor functions, 
we may encourage or penalize different local labeling 
configurations in the neighborhood system to realize the 
topic modeling goal. The BP algorithm operates well on 
the factor graph, and it has the potential to become a 
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generic learning scheme for variants of LDA-based topic 
models. For example, we also extend the BP algorithm 
to learn ATM |3| and RTM [4 J based on the factor graph 
representations. Although the convergence of BP is not 
guaranteed on general graphs |11|, it often converges 
and works well in real-world applications. 

The factor graph of LDA also reveals some intrinsic 
relations between HBM and MRP. HBM is a class of 
directed models within the Bayesian network frame- 
work [14], which represents the causal or conditional 
dependencies of observed and hidden variables in the 
hierarchical manner so that it is difficult to factorize the 
joint probability of hidden variables. By contrast, MRP 
can factorize the joint distribution of hidden variables 
into the product of factor functions according to the 
Hammersley-Clifford theorem IITSl , which facilitates the 
efficient BP algorithm for approximate inference and 
parameter estimation. Although learning HBM often has 
difficulty in estimating parameters and inferring hidden 
variables due to the causal coupling effects, the alterna- 
tive factor graph representation as well as the BP-based 
learning scheme may shed more light on faster and more 
accurate algorithms for HBM. 

The remainder of this paper is organized as follows. In 
Section|2]we introduce the factor graph interpretation for 
LDA, and derive the loopy BP algorithm for approximate 
inference and parameter estimation. Moreover, we dis- 
cuss the intrinsic relations between BP and other state- 
of-the-art approximate inference algorithms. Sections |3] 
and 13] present how to learn ATM and RTM using the BP 
algorithms. Section |5] validates the BP algorithm on four 
document data sets. Finally, Section [6] draws conclusions 
and envisions future work. 

2 Belief Propagation for LDA 

The probabilistic topic modeling task is to assign a set 
of semantic topic labels, z = {z^ j}, to explain the ob- 
served nonzero elements in the document-word matrix 
X = {xw,d}- The notations 1 < fc < i^T is the topic index, 
Xw,d is the number of word counts at the index {w, d}, 
1 < w < W and I < d < D are the word index in 
the vocabulary and the document index in the corpus. 
Table [l] summarizes some important notations. 

Fig. H] shows the original three-layer graphical rep- 
resentation of LDA 111. The document-specific topic 
proportion 9d{k) generates a topic label z'^ ■ e 

{0A}iJ2k=i ■^w d i — 1' which in turn generates each 
observed word token i at the index w in the document 
d based on the topic-specific multinomial distribution 
(f>k{w) over the vocabulary words. Both multinomial pa- 
rameters 6d{k) and 4)k{w) are generated by two Dirichlet 
distributions with hyperparameters a and /3, respec- 
tively. For simplicity, we consider only the smoothed 
LDA (2) with the fixed symmetric Dirichlet hyperpa- 
rameters a and j3. The plates indicate replications. For 
example, the document d repeats D times in the corpus, 
the word tokens Wn repeats Nd times in the document 
d, the vocabulary size is W, and there are K topics. 




Fig. 1 . Three-layer graphical representation of LDA [1]. 

TABLE 1 
Notations 



l<d<D 


Document index 


l<w <W 


Word index in vocabulary 


l<k<K 


Topic index 


l<a<A 


Author index 


1 < c < C 


Link index 




Document-word matrix 


z = {4,d} 


Topic labels for words 


^ — w.d 


Labels for d excluding w 


^w^ — d 


Labels for w excluding d 


ad 


Coauthors of the document d 


M.,d(^) 


Xw,dl-l'W.d{k) 


fJ'w.Xk) 


y~ld •^w.dl^w.dik^ 


Od 


Factor of the document d 


4>w 


Factor of the word w 


Vc 


Factor of the link c 


/(•) 


Factor fimctions 


a, j3 


Dirichlet hyperparameters 



2.1 Factor Graph Representation 

We begin by transforming the directed graph of Fig. [T] 
into a two-layer factor graph, of which a representa- 
tive fragment is shown in Fig. |2] The notation, z'^ ^ = 
^iZi z^ ^ i/xw,d, denotes the average topic labeling 
configuration over all word tokens 1 < i < a;«,,d at 
the index {w,d}. We define the neighborhood system of 
the topic label z^^d as z_^^d and z.^^-d, where z_^,d 
denotes a set of topic labels associated with all word 
indices in the document d except w, and z^.-d denotes 
a set of topic labels associated with the word indices 
w in all documents except d. The factors 9d and 0^, 
are denoted by squares, and their connected variables 
Ziu.d are denoted by circles. The factor Od connects the 
neighboring topic labels {Zw^d^Z-^^d} at different word 
indices within the same document d, while the factor 
0U, connects the neighboring topic labels {z^^d, Zui,-d} at 
the same word index w but in different documents. We 
absorb the observed word w as the index of (fj^/ which 
is similar to absorbing the observed document d as the 
index of Od- Because the factors can be parameterized 
functions [11], both 9d and (f)^ can represent the same 
multinomial parameters with Fig. [T] 

Fig. [2] describes the same joint probability with Fig. [T] 
if we properly design the factor functions. The bipartite 
factor graph is inspired by the collapsed GS [2], [14] al- 
gorithm, which integrates out parameter variables {6, 0} 
in Fig.[T]and treats the hyperparameters {a, /?} as pseudo 



3 




a 





0d 




1 — 




w.d ^ 










4>w 


-< 















\^—w,dj 



Fig. 2. Factor graph of LDA and message passing. 



topic coiints having the same layer with hidden variables 
z. Thus, the joint probability of the collapsed hidden 
variables can be factorized as the product of factor 
functions. This collapsed view has been also discussed 
within the mean-field framework [19], inspiring the zero- 
order approximation CVB (CVBO) algorithm [7J for LDA. 
So, we speculate that all three-layer LDA-based topic 
models can be collapsed into the two-layer factor graph, 
which facilitates the BP algorithm for efficient inference 
and parameter estimation. However, how to use the two- 
layer factor graph to represent more general multi-layer 
HBM still remains to be studied. 

Based on Dirichlet-Multinomial conjugacy, integrating 
out multinomial parameters {9, 0} yields the joint prob- 
ability L14J of LDA in Fig. [B 

P(x, z|a, /3) oc n n ^,yK .yw ^ , ^ X 

nY.tix^.dzi,+p) 



W K 

w=l k=l -'^E«, = l(X]rf=l ^w,dZ^ d + l^)] 



(1) 



where Xw.dzt,d = Y.i=i ^t^d.i recovers the original topic 
configuration over the word tokens in Fig. [1] Here, we 
design the factor functions as 

(2) 



k=l ^i'l2w = li^d=l ^^,dZw,d + P)] 



(3) 

where z.,^ = {zu,,^, z-u;,d} and z^,^. = {z^uM.z^-d} 
denote subsets of the variables in Fig. |2l Therefore, the 
joint probability ^ of LDA can be re-written as the 
product of factor functions [111 Eq. (8.59)] in Fig. |2l 

D W 

P{x,z\a,P) oc J|/ejx.,d,z.,d,a) J| f^^{xni,-,Zw., (3). 

d=l w=l 

(4) 

Therefore, the two-layer factor graph in Fig. |2] encodes 
exactly the same information with the three-layer graph 
for LDA in Fig. [1] In this way, we may interpret LDA 
within the MRF framework to treat probabilistic topic 



modeling as a labeling problem. 



2.2 Belief Propagation (BP) 

The BP [11] algorithms provide exact solutions for infer- 
ence problems in tree-structured factor graphs but ap- 
proximate solutions in factor graphs with loops. Rather 
than directly computing the conditional joint probability 
p{z\x), we compute the conditional marginal probability, 
Pi^t^d = l,2;t«,d|z^i„ -d)' referred to as message 

IJ'w,d{k)r which can be normalized using a local compu- 
tation, i.e., t-^wA^) = 1. < fJ-wA^) < 1- According 
to the Markov property in Fig. |2l we obtain 

PiZw,dT ^W:d\'^~w,~dT^-w,-d) OC 

-P(^tii,di ^vi,d\'^-u),d^ ^-vi,d)PiZw,dJ ^tu.dl^iiJ^-di ^w,-d), (5) 

where — w and —d denote all word indices except 
w and all document indices except d, and the no- 
tations z^iu^d arid z^, _d represent all possible neigh- 
boring topic configurations. From the message pass- 
ing view, p{z'^^^,Xw^d\2'L^^d,x^w,d) is the neighboring 
message ^g^-^z^ ^{k) sent from the factor node 9d, and 
p{zt,d-'^'^A'^t-d-'^w~d) is the other neighboring mes- 
sage M0„-!-z„d(^) sent from the factor node (pw Notice 
that lO uses the smoothness prior in MRF, which en- 
courages only K smooth topic configurations within the 
neighborhood system. Using the Bayes' rule and the joint 
probability (Hjl, we can expand Eq. (|5) as 



cx- 



~K X 

Efe=i(E-«,2;-t",<i^-«,,d + ") 

Tli-d ^w,-dZ^ _d + /? 



(6) 



where the property, r{x + 1) = xr{x), is used to cancel 
the common terms in both nominator and denomina- 
tor [14]. We find that Eq. l|6ll updates the message on 
the variable ^ if its neighboring topic configuration 
{z^^ ^, z^ is known. However, due to uncertainty, 
we know only the neighboring messages rather than 
the precise topic configuration. So, we replace topic 
configurations by corresponding messages in Eq. ^ and 
obtain the following message update equation. 



where 



fJ'w,-d{k)+P 



IJ'-wA^) = ^X-w4l^-wAk), 

— W 



(7) 

(8) 
(9) 



Messages are passed from variables to factors, and in 
turn from factors to variables until convergence or the 
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input : X, K, T, a, (3. 
output : 6d,(j)w 

tAu,di^) ^ initialization and normalization; 
for i ^ 1 to T do 



end 



(fc)+/3 



Fig. 3. The synchronous BP for LDA. 



maximum number of iterations is reached. Notice that 
we need only pass messages for x^^d 7^ 0. Because x is 
a very sparse matrix, the message update equation l|7ll is 
computationally fast by sweeping all nonzero elements 
in the sparse matrix x. 

Based on messages, we can estimate the multinomial 
parameters 9 and (j) by the expectation-maximization 
(EM) algorithm p4|. The E-step infers the normalized 
message iJ.w.d(k). Using the Dirichlet-Multinomial conju- 
gacy and Bayes' rule, we express the marginal Dirichlet 
distributions on parameters as follows. 



piOd) ^ Dir{ed\(J,.,dik) + a), 
p{<j)y,) = Dir{(t)^\fj,^ (k) + P). 



(10) 
(11) 



The M-step maximizes ((TOl l and ((TT] | with respect to dd 
and resulting in the following point estimates of 
multinomial parameters. 



Od{k) 



t^-.dik) 



/^^..(fc)+/3 

T.J^^^^k) + P]' 



(12) 



(13) 



In this paper, we consider only fixed hyperparameters 
{a, Interested readers can figure out how to estimate 
hyperparameters based on inferred messages in [14]. 

To implement the BP algorithm, we must choose either 
the synchronous or the asynchronous update schedule 
to pass messages [20]. Fig. |3| shows the synchronous 
message update schedule. At each iteration t, each 
variable uses the incoming messages in the previous 
iteration t — 1 to calculate the current message. Once 
every variable computes its message, the message is 
passed to the neighboring variables and used to compute 
messages in the next iteration t + 1. An alternative is the 
asynchronous message update schedule. It updates the 
message of each variable immediately. The updated mes- 
sage is immediately used to compute other neighboring 
messages at each iteration t. The asynchronous update 
schedule often passes messages faster across variables, 
which causes the BP algorithm converge faster than 
the synchronous update schedule. Another termination 
condition for convergence is that the change of the 
multinomial parameters |[T| is less than a predefined 
threshold A, for example, A = 0.00001 ||2ll. 



2.3 An Alternative View of BP 

We may also adopt one of the BP instantiations, the sum- 
product algorithm [11], to infer piw.d{k)- For convenience, 
we will not include the observation x^^d in the formula- 
tion. Fig. |2] shows the message passing from two factors 
9d and (pw to the variable Ztu.d/ where the arrows denote 
the message passing directions. Based on the smoothness 
prior, we encourage only K smooth topic configurations 
without considering all other possible configurations. 
The message ^w^ik) is proportional to the product of 
both incoming messages from factors, 

IJ-wAk) oc noa^z^^aik) X (14) 

Eq. |[T4t has the same meaning with (|5). The messages 
from factors to variables are the sum of all incoming 
messages from the neighboring variables, 

/^Sd^z^.d (k) = fe^ n f^-niAk)a, (15) 

^^<t>^^z^Ak) = f<t>^.W_^^w~d{k)|3, (16) 

where a and (3 can be viewed as the pseudo-messages, 
and fg^ and /^^^ are the factor fimctions that encourage 
or penalize the incoming messages. 

In practice, however, Eqs. iflSl l and (161 1 often cause 
the product of multiple incoming messages close to 
zero [12]. To avoid arithmetic underflow, we use the sum 
operation rather than the product operation of incoming 
messages because when the product value increases the 
sum value also increases, 

W_pL-wAk)ct '^'^IJ-wAk) + ct, (17) 
[]Ai^,_d(fc)/3«^Ai,„,_d(fc)+/3. (18) 



-d 



-d 



Such approximations as ((iTt and iflSl l transform the sum- 
product to the sum-sum algorithm, which resembles 
the relaxation labeling algorithm for learning MRF with 
good performance fT2l . 

The normalized message HwAk) is multiplied by the 
number of word counts Xw,d during the propagation, 
i.e., Xw,dlJ'wAk)- In this sense, Xw,d can be viewed as 
the weight of ^w.d{k) during the propagation, where 
the bigger x^^d corresponds to the larger influence of 
its message to those of its neighbors. Thus, the topics 
may be distorted by those documents with greater word 
counts. To avoid this problem, we may choose another 
weight like term frequency (TF) or term frequency- 
inverse document frequency (TF-IDF) for weighted belief 
propagation. In this sense, BP can not only handle discrete 
data, but also process continuous data like TF-IDF. The 
MRF model in Fig. |2] can be extended to describe both 
discrete and continuous data in general, while LDA in 
Fig. [Tj focuses only on generating discrete data. 

In the MRF model, we can design the factor functions 
arbitrarily to encourage or penalize local topic labeling 
configurations based on our prior knowledge. From 
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Fig. [TJ LDA solves the topic modeling problem according 
to three intrinsic assumptions: 

1) Co-occurrence: Different word indices within the 
same document tend to be associated with the same 
topic labels. 

2) Smoothness: The same word indices in different 
documents are likely to be associated with the same 
topic labels. 

3) Clustering: All word indices do not tend to asso- 
ciate with the same topic labels. 

The first assumption is determined by the document- 
specific topic proportion 9d{k), where it is more likely 
to assign a topic label ^ = 1 to the word index 
w if the topic k is more frequently assigned to other 
word indices in the document d. Similarly, the second 
assumption is based on the topic-specific multinomial 
distribution ipk{w). which implies a higher likelihood 
to associate the word index w with the topic label 
= 1 if fc is more frequently assigned to the same 
word index w in other documents except d. The third 
assumption avoids grouping all word indices into one 
topic through normalizing (j>k{w) in terms of all word 
indices. If most word indices are associated with the 
topic k, the multinomial parameter (pk will become too 
small to allocate the topic k to these word indices. 

According to the above assumptions, we design fg^ 
and f^^ over messages as 

Eq. l fT9)l normalizes the incoming messages by the total 
number of messages for all topics associated with the 
document d to make outgoing messages comparable 
across documents. Eq. (|20] | normalizes the incoming mes- 
sages by the total number of messages for all words in 
the vocabulary to make outgoing messages comparable 
across vocabulary words. Notice that (flSl l and ([T6] l realize 
the first two assumptions, and (|20] l encodes the third 
assumption of topic modeling. The similar normalization 
technique to avoid partitioning all data points into one 
cluster has been used in the classic normalized cuts 
algorithm for image segmentation |22| . Combining (|T4] | 
to ||20|| will yield the same message update equation l|71l. 
To estimate parameters dd and (pi^, we use the joint 
marginal distributions ((TSl l and |[T6l l of the set of variables 
belonging to factors 9d and 0„ including the variable 
Zw,d, which produce the same point estimation equa- 
tions ^ and (Us). 

2.4 Simplified BP (siBP) 

We may simplify the message update equation (0. Sub- 
stituting ((12)1 and iflSl l into l|Zll yields the approximate 
message update equation, 

Mt«,d(fc) oc 0d{k) X (/)„(fc), (21) 



function [phi, theta] = siBP (X, K, T, ALPHA, BETA) 

% X is a W*D sparse matrix. 

% W is the vocabulary size. 

% D is the number of documents. 

% The element of X is the word count 'xi' . 

% 'wi' and ' di ' are word and document indices. 

% K is the number of topics. 

% T is the number of iterations . 

% mu is a matrix with K rows for topic messages . 

% phi is a K*W matrix. 

% theta is a K*D matrix. 

% ALPHA and BETA are hyperparameters . 

% normalize (A, dim) returns the normalized values 
% (sum to one) of the elements along different 
% dimensions of an array. 

[wi,di,xi] = find{X); 

% random initialization 

mu ^ normalize ( rand (K, nnz (X) ), 1 ) ; 

% simplified belief propagation 

for t = 1 :T 

for k = 1:K 

md (k, : ) ^ accumarray (di, xi ' . *mu (k, : ) ) ; 

mw (k, : ) = accumarray (wi, xi ' . *mu (k, : ) ) ; 

end 

theta = normalize (md+ALPHA, 1 ) ; %Eq.(9) 

phi = normalize (mw+BETA, 2 ) ; %Eq.(10) 

mu = normalize (theta(:,di) .*phi(:,wi),l); %Eq. (18) 

end 
return 

Fig. 4. The MATLAB code for siBP. 

which includes the current message iJ,w.d{k) in both 
numerator and denominator in (O. In many real- world 
topic modeling tasks, a document often contains many 
different word indices, and the same word index ap- 
pears in many different documents. So, at each iteration, 
Eq. (|2Tl l deviates slightly from (O after adding the cur- 
rent message to both numerator and denominator. Such 
slight difference may be enlarged after many iterations 
in Fig. |3] due to accumulation effects, leading to different 
estimated parameters. Intuitively, Eq. (|2T] l implies that if 
the topic k has a higher proportion in the document d, 
and it has the a higher likelihood to generate the word 
index w, it is more likely to allocate the topic k to the 
observed word Xi^^d- This allocation scheme in principle 
follows the three intrinsic topic modeling assumptions 
in the subsection |2]3l Fig. H shows the MATLAB code 
for the simplified BP (siBP). 

2.5 Relationship to Previous Algorithms 

Here we discuss some intrinsic relations between BP 
with three state-of-the-art LDA learning algorithms such 
as VB UJ, GS [2J, and zero-order approximation CVB 
(CVBO) (Z), ||T9| within the unified message passing 
framework. The message update scheme is an instantia- 
tion of the E-step of EM algorithm |23J, which has been 
widely used to infer the marginal probabilities of hidden 
variables in various graphical models according to the 
maximum-likelihood estimation [llj (e.g., the E-step in- 
ference for GMMs [24 J, the forward-backward algorithm 
for HMMs [25J, and the probabilistic relaxation labeling 
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algorithm for MRF f26l). After the E-step, we estimate 
the optimal parameters using the updated messages and 
observations at the M-step of EM algorithm. 

VB is a variational message passing method | |27| that 
uses a set of factorized variational distributions q{z) 
to approximate the joint distribution ^ by minimiz- 
ing the Kullback-Leibler (KL) divergence between them. 
Employing the Jensen's inequality makes the approxi- 
mate variational distribution an adjustable lower bound 
on the joint distribution, so that maximizing the joint 
probability is equivalent to maximizing the lower bound 
by tuning a set of variational parameters. The lower 
bound q{z) is also an MRF in nature that approximates 
the joint distribution Because there is always a gap 
between the lower bound and the true joint distribution, 
VB introduces bias when learning LDA. The variational 
message update equation is 



exp[^(M.,rf(fc)+Q)] 
exp[*(Efe[/^.,d(fc)+a])] 



(22) 



which resembles the synchronous BP ^ but with two 
major differences. First, VB uses complicated digamma 
functions '^{■), which not only introduces bias [7| but 
also slows down the message updating. Second, VB uses 
a different variational EM schedule. At the E-step, it 
simultaneously updates both variational messages and 
parameter of 9d until convergence, holding the varia- 
tional parameter of fixed. At the M-step, VB updates 
only the variational parameter of 0. 

The message update equation of GS is 



X _ . ' ■- ,. . rr, (23) 



where n~^{k) is the total number of topic labels k in the 
docxmient d except the topic label on the current word 
token i, and (fc) is the total number of topic labels 
k of the word w except the topic label on the current 
word token i. Eq. (|23] l resembles the asynchronous BP 
implementation (O but with two subtle differences. First, 
GS randomly samples the current topic label z^, ; = 1 
from the message iJ.w,d,i{k), which truncates all A'-tuple 
message values to zeros except the sampled topic label 
k. Such information loss introduces bias when learning 
LDA. Second, GS must sample a topic label for each 
word token, which repeats x^^d times for the word index 
{w,d}. The sweep of the entire word tokens rather than 
word index restricts GS's scalability to large-scale docu- 
ment repositories containing billions of word tokens. 

CVBO is exactly equivalent to our asynchronous BP im- 
plementation but based on word tokens. Previous empir- 
ical comparisons |7| advocated the CVBO algorithm for 
LDA within the approximate mean-field framework 1 19j| 
closely connected with the proposed BP. Here we clearly 
explain that the superior performance of CVBO has been 



largely attributed to its asynchronous BP implementation 
from the MRF perspective. Our experiments also support 
that the message passing over word indices instead of 
tokens will produce comparable or even better topic 
modeling performance but with significantly smaller 
computational costs. 

Eq. (|2Tl l also reveals that siBP is a probabilistic matrix 
factorization algorithm that factorizes the document- 
word matrix, x — [x^^dlwxD, into a matrix of document- 
specific topic proportions, = [Od{k)]KxD, and a ma- 
trix of vocabulary word-specific topic proportions, 4> = 
[(l)wik)]Kxw> i-e., X cjP'O. We see that the larger number 
of word counts Xuj,d corresponds to the higher likelihood 
Efc ^d{k)4>w{k). From this point of view, the multinomial 
principle component analysis (PCA) [28] describes some 
intrinsic relations among LDA, PLSA fT6l , and non- 
negative matrix factorization (NMF) [29^. Eq. (|2T) l is the 
same as the E-step update for PLSA except that the pa- 
rameters 6 and (p are smoothed by the hyperparameters 
a and /? to prevent overfitting. 

VB, BP and siBP have the computational complexity 
0{KDWdT), but GS and CVBO require 0{KDNdT), 
where Wd is the average vocabulary size, Nd is the 
average number of word tokens per document, and T 
is the number of learning iterations. 

3 Belief Propagation for ATM 

Author-topic models (ATM) |3 J depict each author of the 
document as a mixture of probabilistic topics, and have 
found important applications in matching papers with 
reviewers l30l . Fig. |5K shows the generative graphical 
representation for ATM, which first uses a document- 
specific uniform distribution Ud to generate an author 
index a,l < a < A, and then uses the author-specific 
topic proportions 9a to generate a topic label ^ = 1 
for the word index w in the document d. The plate on 
9 indicates that there are A unique authors in the cor- 
pus. The document often has multiple coauthors. ATM 
randomly assigns one of the observed author indices to 
each word in the document based on the document- 
specific uniform distribution Ud- However, it is more 
reasonable that each word Xw^d is associated with an 
author index a G from the multinomial rather than 
uniform distribution, where is a set of author indices 
of the document d. As a result, each topic label takes two 
variables ^ 



l,a e ad, 1 < fc < i^. 



where a is the author index and k is the topic index 
attached to the word. 

We transform Fig. |5j\ to the factor graph representa- 
tion of ATM in Fig. |5^. As with Fig. El we absorb the 
observed author index a G a^ of the document d as the 
index of the factor 9aeaa- The notation . denotes all 
labels connected with the authors a G a^ except those for 
the word index w. The only difference between ATM and 
LDA is that the author a G a^ instead of the document 
d connects the labels ^ and zt^ .. As a result, ATM 
encourages topic smoothness among labels zJJ, ^ attached 
to the same author a instead of the same document d. 
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Fig. 5. (A) The three-layer graphical representation (3] and (B) two-layer factor graph of ATM. 



input : X, arf,iv:,T, a,/3. 

output : 9a,<})w 

n"-^\{k),a e Sid, ^ initialization and normalization; 
for i ^ 1 to T do 

end 

</.„(fc) ^ [/x^^ifc) + /?]/ E.jM„..(fc) + 



Fig. 6. The synchronous BP for ATM. 



3.1 Inference and Parameter Estimation 

Unlike passing the /-C- tuple message fj,w,d{k) in Fig. |3l 
the BP algorithm for learning ATM passes the ja^l x K- 
tuple message vectors /i° ^(fc), a e through the factor 
Oaesid in Fig. |5^, where |arf| is the number of authors in 
the document d. Nevertheless, we can still obtain the K- 
tuple word topic message fJ.w,d{k) by marginalizing the 
message ^(fc) in terms of the author variable a e a^ 
as follows. 



lJ-w,d{k) 



(24) 



Since Figs. |2] and |5p have the same right half part, 
the message passing equation from the factor to the 
variable z^,,d and the parameter estimation equation for 
(pw in Fig. |5p remain the same as (0 and (|T3] l based on 
the marginalized word topic message in l|24l l. Thus, we 
only need to derive the message passing equation from 
the factor 6'aead to the variable zj^ ^ in Fig.|5p. Because of 
the topic smoothness prior, we design the factor function 
as follows, 

1 



- a 



(25) 



where /x^,„^_d(fc) = T.-^u^-d<,d^^l.dik) denotes the 
sum of all incoming messages attached to the author 



index a and the topic index k excluding x'^^ di^w di^)- 
Likewise, Eq. ( |25t normalizes the incoming messages 
attached the author index a in terms of the topic index 
k to make outgoing messages comparable for different 
authors a G a^. Similar to ((TSl l, we derive the message 
passing ji -^z<^ ^ through adding all incoming messages 
evaluated by the factor function l|25] l. 

Multiplying two messages from factors Oaea^ and (p^ 
yields the message update equation as follows, 

„_^(fc)+a ^l^_^{k) + P 



Eklt^-^.-dik) 



-dik) + l3Y 

(26) 



Notice that the |ad| x if -tuple message fx^dW^a e a^ 
is normalized in terms of all combinations of {a,k},a £ 
ad, 1 < k < K. Based on the normalized messages, the 
author-specific topic proportion 9a{k) can be estimated 
from the sum of all incoming messages including /i° ^ 
evaluated by the factor function fg^ as follows. 



9a{k) = 



(27) 



Efc[K.(fc) + «]' 

As a summary. Fig. |6] shows the synchronous BP 
algorithm for learning ATM. The difference between 
Fig. |3] and Fig. [6] is that Fig. |3] considers the author 
index a as the label for each word. At each iteration, 
the computational complexity is O(KDWdAdT), where 
Ad is the average number of authors per document. 

4 Belief Propagation for RTM 

Network data, such as citation and coauthor networks 
of documents [30|, 131], tag networks of documents and 
images [32J, hyperlinked networks of web pages, and 
social networks of friends, exist pervasively in data min- 
ing and machine learning. The probabilistic relational 
topic modeling of network data can provide both useful 
predictive models and descriptive statistics [4j. 

In Fig. 0^, relational topic models (RTM) [4] represent 
entire document topics by the mean value of the docu- 




(A) (B) 

Fig. 7. (A) The three-layer graphical representation [4] and (B) two-layer factor graph of RTM. 



input : w, c, ^, K, T, a, (5. 
output : Oa,<})w 

^ initialization and normalization; 
for i ^ 1 to T do 

[(1 X ^(fc); 
end 

0,(A;)^[^.^,(fc)+a]/EjA*,rf(fc)+«]; 

ct>^{k) ^ [M„,.(fc) + /3]/ E»[A*»..(fc) + /?]; 



Fig. 8. The synchronous BP for RTM. 

ment topic proportions, and use Hadamard product of 
mean values Id o^d' from two linked documents {d,d'} 
as link features, which are learned by the generalized 
linear model (GLM) to generate the observed binary 
citation link variable c = 1. Besides, all other parts in 
RTM remain the same as LDA. 

We transform Fig. [7j\ to the factor graph Fig. [7^ by 
absorbing the observed link index cec,l<c<Cas 
the index of the factor r/c- Each link index connects a 
document pair {d,d'}, and the factor r/c connects word 
topic labels Zw,d and z.^d' of the document pair. Besides 
encoding the topic smoothness, RTM explicitly describes 
the topic structural dependencies between the pair of 
linked documents {d, d'} using the factor function /,,^(-). 

4.1 Inference and Parameter Estimation 

In Fig. the messages from the factors 0d and if)w to the 
variable Zw,d are the same as LDA in l(T5] l and ((T6] l. Thus, 
we only need to derive the message passing equation 
from the factor rjc to the variable z^.d. 

We design the factor function /,,,,(■) for linked docu- 
ments as follows. 



E{d,d'},fe' M-,<i(fc)M.d'(fc')' 



(28) 



which depicts the likelihood of topic label k assigned 
to the document d when its linked document d' is 



associated with the topic label k' . Notice that the de- 
signed factor function does not follow the GLM for 
link modeling in the original RTM [4] because the GLM 
makes inference slightly more complicated. However, 
similar to the GLM, Eq. ll28l l is also able to capture the 
topic interactions between two linked documents {d,d'} 
in document networks. Instead of smoothness prior 
encoded by factor functions (|T9] l and ll20l l, it describes 
arbitrary topic dependencies {k, k'} of linked documents 
{d,d'}. 

Based on the factor function (|28|l , we resort to the sum- 
product algorithm to calculate the message. 



(29) 



where we use the sum rather than the product of 
messages from all linked documents d' to avoid arith- 
metic underflow. The standard sum-product algorithm 
requires the product of all messages from factors to vari- 
ables. However, in practice, the direct product operation 
cannot balance the messages from different sources. For 
example, the message ^e^^z„ a is from the neighboring 
words within the same document d, while the message 
/^j)e-+Zu, d is from all linked documents d'. If we pass 
the product of these two types of messages, we cannot 
distinguish which one influences more on the topic label 
z^^d- Hence, we use the weighted sum of two types of 
messages. 



= k) (x[(l - ^)^8^^,^ ^{k) 



Ak), (30) 



where ^ e [0, 1] is the weight to balance two messages 
fJ-Od^z^ d ^rid ii,,i^^z^. d- When there are no link informa- 
tion ^ = 0, Eq. ll30t reduces to (0 so that RTM reduces 
to LDA. Fig. |8] shows the synchronous BP algorithm 
for learning RTM. Given the inferred messages, the 
parameter estimation equations remain the same as ((T2)l 
and ((T3) l. The computational complexity at each iteration 
is 0{K'^CDWdT), where C is the total number of links 
in the document network. 
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TABLE 2 

Summarization of four document data sets 



Data sets 


D 


A 


W 


C 


Nd 


Wd 


CORA 


2410 


2480 


2961 


8651 


57 


43 


MEDL 


2317 


8906 


8918 


1168 


104 


66 


NIPS 


1740 


2037 


13649 




1323 


536 


BLOG 


5177 




33574 


1549 


217 


149 



5 Experiments 

We use four large-scale document data sets: 

1) CORA l33l contains abstracts from the CORA re- 
search paper search engine in machine learning 
area, where the documents can be classified into 
7 major categories. 

2) MEDL [34] contains abstracts from the MEDLINE 
biomedical paper search engine, where the docu- 
ments fall broadly into 4 categories. 

3) NIPS 1 35 1 includes papers from the conference 
"Neural Information Processing Systems", where 
all papers are grouped into 13 categories. NIPS has 
no citation link information. 

4) BLOG [36] contains a collection of political blogs on 
the subject of American politics in the year 2008. 
where all blogs can be broadly classified into 6 
categories. BLOG has no author information. 

Table 12] summarizes the statistics of four data sets, where 
D is the total number of documents, A is the total 
number of authors, W is the vocabulary size, C is the 
total number of links between documents, Nd is the 
average number of words per document, and Wd is the 
average vocabulary size per document. 

5.1 BP for LDA 

We compare BP with two commonly-used LDA learning 
algorithms such as VB [1] (Here we use Blei's imple- 
mentation of digamma f unctions and GS |20 under 
the same fixed hyperparameters a ~ ji = 0.01. We 
use MATLAB C/C++ MEX-implementations for all these 
algorithms, and carry out experiments on a common 
PC with CPU 2.4GHz and RAM 4G. With the goal of 
repeatability, we have made our source codes and data 
sets publicly available [37J. 

To examine the convergence property of BP, we use 
the entire data set as the training set, and calculate the 
training perplexity [IJ at every 10 iterations in the total 
of 1000 training iterations from the same initialization. 
Fig. |9] shows that the training perplexity of BP generally 
decreases rapidly as the number of training iterations 
increases. In our experiments, BP on average converges 
with the number of training iterations T w 170 when the 
difference of training perplexity between two successive 
iterations is less than one. Although this paper does not 



theoretically prove that BP will definitely converge to 
the fixed point, the resemblance among VB, GS and BP 
in the subsection 12.51 implies that there should be the 
similar underlying principle that ensures BP to converge 
on general sparse word vector space in real-world appli- 
cations. Further analysis reveals that BP on average uses 
more number of training iterations until convergence 
than VB (T « 100) but much less number of training 
iterations than GS (T « 300) on the four data sets. The 
fast convergence rate is a desirable property as far as 
the online [21J and distributed [38J topic modeling for 
large-scale corpus are concerned. 

The predictive perplexity for the unseen test set is 
computed as follows [1], [7]. To ensure all algorithms 
to achieve the local optimum, we use the 1000 training 
iterations to estimate (f) on the training set from the 
same initialization. In practice, this number of training 
iterations is large enough for convergence of all algo- 
rithms in Fig. m We randomly partition each document 
in the test set into 90% and 10% subsets. We use 1000 
iterations of learning algorithms to estimate 9 from the 
same initialization while holding fixed on the 90% 
subset, and then calculate the predictive perplexity on 
the left 10% subset. 



V = exp ■ 



10% 



(31) 



http://www.cs.prmceton.edu/~blei/lda-c/mdex.html 

http:/ / psiexp.ss.uci.edu/ research/programs_data/ too box.htmj 



where x]^ ^ denotes word counts in the 10% subset. 
Notice that the perplexity | [3T] | is based on the marginal 
probability of the word topic label iiw.d{k) in (|2Tl l. 

Fig. [To] shows the predictive perplexity (average ± 
standard deviation) from five-fold cross-validation for 
different topics, where the lower perplexity indicates 
the better generalization ability for the unseen test set. 
Consistently, BP has the lowest perplexity for different 
topics on four data sets, which confirms its effectiveness 
for learning LDA. On average, BP lowers around 11% 
than VB and 6% than GS in perplexity. Fig. [11] shows 
that BP uses less training time than both VB and GS. 
We show only 0.3 times of the real training time of VB 
because of time-consuming digamma functions. In fact, 
VB runs as fast as BP if we remove digamma functions. 
So, we believe that it is the digamma functions that 
slow down VB in learning LDA. BP is faster than GS 
because it computes messages for word indices. The 
speed difference is largest on the NIPS set due to its 
largest ratio Nd/Wd = 2.47 in Table [2] Although VB 
converges rapidly attributed to digamma functions, it 
often consumes triple more training time. Therefore, BP 
on average enjoys the highest efficiency for learning 
LDA with regard to the balance of convergence rate and 
training time. 

We also compare six BP implementations such as 
siBP, BP and CVBO |7] using both synchronous and 
asynchronous update schedules. We name three syn- 
chronous implementations as s-BP, s-siBP and s-CVBO, 
and three as5mchronous implementations as a-BP, a-siBP 
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Fig. 9. Training perplexity as a function of number of iterations winen K = 50 on CORA, MEDL, NIPS and BLOG. 
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Fig. 1 0. Predictive perplexity as a function of number of topics on CORA, MEDL, NIPS and BLOG. 
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Fig. 1 1 . Training time as a function of number of topics on CORA, MEDL, NIPS and BLOG. For VB, it shows 0.3 times 
of the real learning time denoted by 0.3x. 



and a-CVBO. Because these six belief propagation im- 
plementations produce comparable perplexity, we show 
the relative perplexity that subtracts the mean value 
of six implementations in Fig. [l2l Overall, the asyn- 
chronous schedule gives slightly lower perplexity than 
synchronous schedule because it passes messages faster 
and more efficiently. Except on CORA set, siBP generally 
provides the highest perplexity because it introduces 
subtle biases in computing messages at each iteration. 
The biased message will be propagated and accumulated 
leading to inaccurate parameter estimation. Although 
the proposed BP achieves lower perplexity than CVBO on 
NIPS set, both of them work comparably well on other 
sets. But BP is much faster because it computes messages 
over word indices. The comparable results also confirm 
our assumption that topic modeling can be efficiently 
performed on word indices instead of tokens. 

To measure the interpretability of a topic model, the 
word intrusion and topic intrusion are proposed to in- 
volve subjective judgements Il39ll . The basic idea is to 
ask volunteer subjects to identify the number of word 



intruders in the topic as well as the topic intruders in the 
document, where intruders are defined as inconsistent 
words or topics based on prior knowledge of subjects. 
Fig. [13] shows the top ten words of if = 10 topics inferred 
by VB, GS and BP algorithms on NIPS set. We find no 
obvious difference with respect to word intrusions in 
each topic. Most topics share the similar top ten words 
but with different ranking orders. Despite significant 
perplexity difference, the topics extracted by three algo- 
rithms remains almost the same interpretability at least 
for the top ten words. This result coincides with l39ll that 
the lower perplexity may not enhance interpretability of 
inferred topics. 

Similar phenomenon has also been observed in MRF- 
based image labeling problems [20 1 . Different MRF infer- 
ence algorithms such as graph cuts and BP often yield 
comparable results. Although one inference method may 
find more optimal MRF solutions, it does not neces- 
sarily translate into better performance compared to 
the ground-truth. The underlying hypothesis is that the 
ground-truth labeling configuration is often less optimal 
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Fig. 12. Relative predictive perplexity as a function of number of topics on CORA, MEDL, NIPS and BLOG. 
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Fig. 1 3. Top ten words of if = 10 topics of VB (first line), GS (second line), and BP (third line) on NIPS. 



than solutions produced by inference algorithms. For 
example, if we manually label the topics for a corpus, 
the final perplexity is often higher than that of solutions 
returned by VB, GS and BP. For each document, LDA 
provides the equal number of topics K but the ground- 
truth often uses the unequal number of topics to explain 
the observed words, which may be another reason why 
the overall perplexity of learned LDA is often lower 
than that of the ground-truth. To test this hypothesis, 
we compare the perplexity of labeled LDA (L-LDA) [40] 
with LDA in Fig. ^ L-LDA is a supervised LDA that 
restricts the hidden topics as the observed class labels 
of each document. When a document has multiple class 



labels, L-LDA automatically assigns one of the class 
labels to each word index. In this way, L-LDA resembles 
the process of manual topic labeling by human, and its 
solution can be viewed as close to the ground-truth. 
For a fair comparison, we set the number of topics 
K = 7,4,13,6 of LDA for CORA, MEDL, NIPS and 
BLOG according to the number of document categories 
in each set. Both L-LDA and LDA are trained by BP 
using 500 iterations from the same initialization. Fig. [14] 
confirms that L-LDA produces higher perplexity than 
LDA, which partly supports that the ground-truth often 
yields the higher perplexity than the optimal solutions 
of LDA inferred by BP. The underlying reason may be 
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that the three topic modehng rules encoded by LDA are 
still too simple to capture human behaviors in finding 
topics. 

Under this situation, improving the formulation of 
topic models such as LDA is better than improving 
inference algorithms to enhance the topic modeling 
performance significantly. Although the proper settings 
of hyperparameters can make the predictive perplexity 
comparable for all state-of-the-art approximate inference 
algorithms [7], we still advocate BP because it is faster 
and more accurate than both VB and GS, even if they all 
can provide comparable perplexity and interpretability 
under the proper settings of hyperparameters. 

5.2 BP for ATM 

The GS algorithm for learning ATM is implemented in 
the MATLAB topic modeling toolbox]^ We compare BP 
and GS for learning ATM based on 500 iterations on 
training data. Fig. [15] shows the predictive perplexity 
(average ± standard deviation) from five-fold cross- 
validation. On average, BP lowers 12% perplexity than 
GS, which is consistent with Fig. [lOl Another possible 
reason for such improvements may be our assumption 
that all coauthors of the document account for the word 
topic label using multinomial instead of uniform proba- 
bilities. 

5.3 BP for RTM 

The GS algorithm for learning RTM is implemented in 
the R package^ We compare BP with GS for learning 
RTM using the same 500 iterations on training data set. 
Based on the training perplexity, we manually set the 
weight ^ = 0.15 in Fig. |8]to achieve the overall superior 
performance on four data sets. 

Fig. [16] shows predictive perplexity (average ± stan- 
dard deviation) on five-fold cross-validation. On av- 
erage, BP lowers 6% perplexity than GS. Because the 
original RTM learned by GS is inflexible to balance 
information from different sources, it has slightly higher 



http://psiexp.ss.uci.edu/research/programs_data/toolbox.html 



http://cran.r-project.org/ web/packages/lda/J 



perplexity than LDA (Fig. [10). To circumvent this prob- 
lem, we introduce the weight in l i30] l to balance two 
types of messages, so that the learned RTM gains lower 
perplexity than LDA. Future work will estimate the 
balancing weight ^ based on the feature selection or MRF 
structure learning techniques. 

We also examine the link prediction performance of 
RTM. We define the link prediction as a binary clas- 
sification problem. As with [4J, we use the Hadmard 
product of a pair of document topic proportions as 
the link feature, and train an SVM [41] to decide if 
there is a link between them. Notice that the original 
RTM lO learned by the GS algorithm uses the GLM to 
predict links. Fig. [171 compares the F-measure (average ± 
standard deviation) of link prediction on five-fold cross- 
validation. Encouragingly, BP provides significantly 15% 
higher F-measure over GS on average. These results 
confirm the effectiveness of BP for capturing accurate 
topic structural dependencies in document networks. 

6 Conclusions 

First, this paper has presented the novel factor graph 
representation of LDA within the MRF framework. Not 
only does MRF solve topic modeling as a labeling prob- 
lem, but also facilitate BP algorithms for approximate 
inference and parameter estimation in three steps: 

1) First, we absorb {w, d} as indices of factors, which 
connect hidden variables such as topic labels in the 
neighborhood system. 

2) Second, we design the proper factor functions to 
encourage or penalize different local topic labeling 
configurations in the neighborhood system. 

3) Third, we develop the approximate inference and 
parameter estimation algorithms within the mes- 
sage passing framework. 

The BP algorithm is easy to implement, computation- 
ally efficient, faster and more accurate than other two 
approximate inference methods like VB [IJ and GS 121 
in several topic modeling tasks of broad interests. Fur- 
thermore, the superior performance of BP algorithm for 
learning ATM [3J and RTM [4J confirms its potential 
effectiveness in learning other LDA extensions. 

Second, as the main contribution of this paper, the 
proper definition of neighborhood systems as well as the 
design of factor functions can interpret the three-layer 
LDA by the two-layer MRF in the sense that they encode 
the same joint probability. Since the probabilistic topic 
modeling is essentially a word annotation paradigm, the 
opened MRF perspective may inspire us to use other 
MRF-based image segmentation [22J or data clustering 
algorithms [17] for LDA-based topic models. 

Finally, the scalability of BP is an important issue in 
our future work. As with VB and GS, the BP algorithm 
has a linear complexity with the number documents D 
and the number of topics K. We may extend the pro- 
posed BP algorithm for online [21] and distributed | [38| 
learning of LDA, where the former incrementally learns 
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Fig. 15. Predictive perplexity as a function of number of topics for ATM on CORA, MEDL and NIPS. 
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Fig. 16. Predictive perplexity as a function of number of topics for RTM on CORA, MEDL and BLOG. 
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Fig. 17. F-measure of link prediction as a function of number of topics on CORA, MEDL and BLOG. 



parts of D documents in data streams and the latter 
learns parts of D documents on distributed computing 
units. Since the i^T-tuple message is often sparse |42|, we 
may also pass only salient parts of the if -tuple messages 
or only update those informative parts of messages at 
each learning iteration to speed up the whole message 
passing process. 
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